1 Model Validation - “Advanced” vs “Basic”

  • Basic validation does not exist.
  • ‘Basic’ concepts:
    • Bias-variance trade-off
    • Decomposition of mean squared error (MSE)
    • Loss minimisation
    • Loss properties - efficiency, consistency
    • Scoring rules
  • An Introduction to Statistical Learning (James et al. 2013)

2 Outcomes - Model Results vs Model Performance Results

When a measure becomes a target, it ceases to be a good measure - Charles Goodhart/Marilyn Strathern (1997)

3 Machine Learning Framework

  • Any ‘classical’ model fits in the ML framework
  • Key difference is focus on fit/predict workflow (instead of inference and inspection)
  • External Validation: Classical models require parameters/coefficients. ML requires storage, seeds, software versions and hyper-parameter ranges.
  • Parsimony principle: If multiple models perform equally well, select the simplest. Particularly important in ML for black-box models.

3.1 Fit/Predict/Tune Workflow

3.1.1 The wrong way to learn

3.1.2 A better approach

3.2 A Good ML Framework

  1. Select a model, or multiple models
  2. Depending on data size, split with cross-validation, bootstrap, or hold-out
  3. Fit model(s) on training data
  4. Make predictions on test data
  5. Evaluate

Optionally: Split training data again for model tuning. To tune, fit models on the `nested’ training data and select hyper-parameter configuration that optimises performances on the nested test set.

3.3 Application: The MVP

# 1. Pick a model
learn = lrn("regr.lm")
# 2. Load data
task = tsk("mtcars")
print(task)
## <TaskRegr:mtcars> (32 x 11)
## * Target: mpg
## * Properties: -
## * Features (10):
##   - dbl (10): am, carb, cyl, disp, drat, gear, hp, qsec, vs, wt
task$head()
##     mpg am carb cyl disp drat gear  hp  qsec vs    wt
## 1: 21.0  1    4   6  160 3.90    4 110 16.46  0 2.620
## 2: 21.0  1    4   6  160 3.90    4 110 17.02  0 2.875
## 3: 22.8  1    1   4  108 3.85    4  93 18.61  1 2.320
## 4: 21.4  0    1   6  258 3.08    3 110 19.44  1 3.215
## 5: 18.7  0    2   8  360 3.15    3 175 17.02  0 3.440
## 6: 18.1  0    1   6  225 2.76    3 105 20.22  1 3.460
# 3. Split data with 2/3 hold-out split
train = sample(task$nrow, 2/3 * task$nrow)
test = setdiff(seq(task$nrow), train)
# 3. Fit model to data
learn$train(task, row_ids = train)
learn$model
## 
## Call:
## stats::lm(formula = task$formula(), data = task$data())
## 
## Coefficients:
## (Intercept)           am         carb          cyl         disp         drat  
##    15.51914      4.28928     -0.04438      0.40173      0.03062     -1.84221  
##        gear           hp         qsec           vs           wt  
##     1.47485     -0.04911      1.06293      2.13296     -5.71573
# 4. Make predictions on new data
pred = learn$predict(task, row_ids = test)
print(pred)
## <PredictionRegr> for 11 observations:
##     row_id truth response
##          3  22.8 27.57143
##          6  18.1 22.80786
##          8  24.4 22.74496
## ---                      
##         26  27.3 30.09445
##         28  30.4 30.56697
##         31  15.0 11.39905
# 5. Evaluate
pred$score()
## regr.mse 
## 15.91255

4 Why Validate?

Arguments against validation:

  • “All models are wrong” (Box 1976)
  • “…the super learner framework allows a researcher to try many prediction algorithms… knowing that the final combined super learner fit will either be the best fit or near the best fit” (Polley and Van Der Laan 2010)

Arguments for validation:

  • “…but some are useful” (Box 1976)
  • Parismony Principle
  • Scientific Method

4.1 Aggregation vs. Linear Regression

gr = Graph$new()$
  add_pipeop(po("learner", lrn("regr.svm")))$
  add_pipeop(po("learner", lrn("regr.xgboost")))$
  add_pipeop(po("learner", lrn("regr.ranger")))$
  add_pipeop(po("regravg"))$
  add_edge("regr.svm", "regravg")$
  add_edge("regr.xgboost", "regravg")$
  add_edge("regr.ranger", "regravg")
plot(gr, html = TRUE)
learn_avg = GraphLearner$new(gr)

gr = Graph$new()$
  add_pipeop(po("learner", lrn("regr.svm")))$
  add_pipeop(po("learner", lrn("regr.xgboost")))$
  add_pipeop(po("learner", lrn("regr.ranger")))$
  add_pipeop(po("learner", lrn("regr.lm")))$
  add_pipeop(po("regravg"))$
  add_edge("regr.svm", "regravg")$
  add_edge("regr.xgboost", "regravg")$
  add_edge("regr.ranger", "regravg")$
  add_edge("regr.lm", "regravg")
plot(gr, html = TRUE)
learn_avg_lm = GraphLearner$new(gr)
learn_lm = lrn("regr.lm")
task = tsk("mtcars")
design = benchmark_grid(task, list(learn_avg, learn_avg_lm, learn_lm), rsmp("cv", folds = 5))
bm = benchmark(design)

The winner is…

bm$aggregate(msrs(c("regr.rmse", "regr.rmse_se")))[,c("learner_id", "regr.rmse", "regr.rmse_se")]
##                                           learner_id regr.rmse regr.rmse_se
## 1:         regr.svm.regr.xgboost.regr.ranger.regravg  5.657743    1.3362217
## 2: regr.svm.regr.xgboost.regr.ranger.regr.lm.regravg  4.461551    1.1802586
## 3:                                           regr.lm  3.295919    0.7509652

5 Internal and External Validation

  • Internal validation: Models are validated using the same dataset for training and testing.
  • External validation: Models are validated using an external dataset.

5.1 External Validation

External validation is the gold-standard for evaluation as not only is the data completely unseen and new to the model, but the underlying data-generation process may even be different. If data is trained and tested on the exact same data then no indication is given to how the model may perform on new data. In general a model’s training performance does not correlate with its testing performance. Moreover, without external validation, there is no way to tell if your dataset is even representative, which can lead to strange results…

5.1.1 Case-Study: An Unimportant Example

The TOT model (Tan, Osman, and Tan 1997): \(Y = 88.1 + 0.51X\) where \(Y\) is ear circumference (mm) and \(X\) is age (years).

Their conclusion:

“In any event, this knowledge may be useful for forensic scientists and can be an alternative to dental records in the future. It may be speculated that if Van Gogh’s ear had been fossilized, even if it were to be found 2000 years from now, one could always calculate the age at which he lost his ear from self mutilation!”

External validation (Mateen and Sonabend 2019):

The model was tested on 100 academics at The Alan Turing Institute. The conclusion found that always predicting someone’s ear circumference as the global median was as good as the TOT model.

The problem:

Their model was fit on 100 residents and staff from a single Texan old-age home, which is not really representative of the world’s population…

5.1.2 Case-Study: An Important Example

Many models have been proposed to predict the probability of death or disease progression from COVID-19. A rigorously tested model with significant predictive power allows it to be used in hospitals and government to inform policy. On the other hand, a model that is only superficially well-performing could result in incorrect and possibly harmful decisions being made.

A recent (pre-print!!!) systematic review (Gupta et al. 2020) found that of 22 models that claimed to be predictive of COVID-19 mortality, none stood up to external validation and were all deemed unsuitable.

5.1.3 Unfortunately…

In the vast majority of experiments, external validation is simply not possible. This can be for a variety of reasons but usually this is simply due to lack of data. Ultimately researchers will (and should) prefer to use as much data as possible to fit a model. Furthermore, when data is collected for a specific experiment, then there will genuinely be no other external datasets to compare to until another study comes along and requires collection of their own data.

Luckily, procedures exist to estimate the results of external validation.

5.2 Internal Validation

There are two types of error we are discussing: training and prediction error. Training error is calculated by evaluating predictions on the exact same data used for fitting, testing error is calculated by evaluating predictions on unseen data after fitting. Training error is (almost) always over-optimistic, i.e. lower, than the true prediction error (Hastie, Tibshirani, and Friedman 2001).

task = tsk("mtcars")
learn = lrn("regr.lm")
errors = replicate(100, {
  train = sample(task$nrow, 2/3 * task$nrow)
  test = setdiff(seq(task$nrow), train)
  train_err = as.numeric(learn$train(task)$predict(task)$score())
  test_err = as.numeric(learn$train(task, row_ids = train)$predict(task, row_ids = test)$score())
  return(list(train = train_err, test = test_err))
})
boxplot(unlist(errors[2,]))
abline(h = learn$train(task)$predict(task)$score(), col = 2)

5.2.1 Cross-Validation

True internal validation, i.e. where identical data is used for training and testing, is rarely used in practice, instead resampling methods are utilised. The simplest resampling method we have already seen above, ‘hold-out’. Hold-out is the simple case of randomly sampling the data into a training set and a testing set, the usual split is 2/3 of the data for training and 1/3 for testing (Hastie, Tibshirani, and Friedman 2001). Whilst this is a simple procedure and allows for the majority of data to be used for model fitting, it does not provide an estimate of the out-of-sample, or generalisation, error: the prediction error over an independent test sample. So-called ‘generalisation’ as it provides a mechanism for determining how well the model ‘generalises’, i.e. performs on datasets other than the one used for training. \(K\)-fold cross-validation is a resampling procedure that effectively estimates this error. The cross-validation procedure is relatively simple and is displayed algorithmically below, the dataset is split into \(K\) `folds’, \(K-1\) are used for training and the \(K\)th is held-out for testing. This procedure is repeated \(K\) times so that all folds can be used for testing at same stage. The predictions on the \(K\) folds are evaluated and aggregated into a single score.

Given a dataset \(d\):

  1. Split \(d\) into \(K\) roughly-equally sized datasets, \(\{d_1,...,d_K\}\)
  2. For \(i\) in \(1,...,K\), do:
    1. Train model on \(\{d_1,...,d_{K}\} \backslash \{d_i\}\)
    2. Make predictions for \(d_i\)
    3. Evaluate prediction to get a score \(L_i\)
  3. Prediction error is estimated as \(\hat{L} = \frac{1}{K} \sum_{i = 1}^K L_K\)
learn = lrn("regr.lm")
task = tsk("mtcars")
rr = resample(task, learn, rsmp("cv", folds = 5))
rr$score()[,c("learner_id", "iteration", "regr.mse")]
##    learner_id iteration  regr.mse
## 1:    regr.lm         1  6.354374
## 2:    regr.lm         2 43.089049
## 3:    regr.lm         3  8.415706
## 4:    regr.lm         4 12.899121
## 5:    regr.lm         5  7.237533
rr$aggregate()
## regr.mse 
## 15.59916
learn = lrn("regr.lm")
task = tsk("mtcars")
resamp = replicate(50, {
  score = c()
  for (i in c(5, 10)) {
    rr = resample(task, learn, rsmp("cv", folds = i))
    score = c(score, as.numeric(rr$aggregate()))
  }
  return(score)
})
loocv = as.numeric(resample(task, learn, rsmp("cv", folds = task$nrow))$aggregate())
plot(resamp[1,], type = "l", xlab = "Simulation", ylab = "MSE")
lines(resamp[2,], col = 2)
lines(rep(loocv, 50), col = 3)
legend("topright", lwd = 1, col = 1:3, legend = c(5, 10, 32))

6 Model Comparison and Significance Testing

The example at the start comparing linear regression to ensemble methods demonstrates the need for benchmark experiments. Benchmark experiments compare the performance of multiple models in order to determine which is the ‘best’. Unfortunately, even using a suitable resampling method to estimate prediction error is not enough to determine if one model is ‘better’ than another. Determining if one model ‘outperforms’ another comes down to comparison of standard errors and significance testing.

In practice formal comparison between models is rarely observed, even in top journals (Király, Mateen, and Sonabend 2018). Yet these are vital to establishing if one model outperforms another. Whilst allowances can be made for a lack of availability in significance testing; confidence intervals are simple to calculate…

6.1 Standard Errors

In the simplest case, Normal approximations can be used to derive confidence intervals. As a standard caveat, when multiple experiments are performed, multiple testing correction should be applied to the constructed confidence intervals.

learns = lrns(paste0("regr.", c("lm", "rpart")))
task = tsk("mtcars")
design = benchmark_grid(task, learns, rsmp("cv", folds = 5))
res = benchmark(design)$aggregate(msrs(c("regr.rmse", "regr.rmse_se")))
plot(res$regr.rmse, ylim = c(2, 7), xaxt = "n", ylab = "RMSE", xlab = "")
arrows(1:2, y0 = res$regr.rmse - (res$regr.rmse_se * 1.96), 
       y1 = res$regr.rmse + (res$regr.rmse_se * 1.96), code = 3, angle = 90)

6.2 Significance Testing

A slightly more rigorous method is to perform hypothesis testing on the results. One method to do so is to use a Wilcoxon signed-rank test to compare the residuals across all folds. Effectively this tests if there is a significantly lower location shift of the residuals produced by a given model/measure compared to another.

learners <- list(makeLearner("regr.lm", id = "lm"),
                 makeLearner("regr.randomForest", id = "RF"),
                 makeLearner("regr.nnet", id = "NN"),
                 makeLearner("regr.featureless"))
bmresults <- benchmark(learners, bh.task, cv5, list(rmse, rmse.se, mae, mae.se))
test <- comparisontest(bmresults)[[1]][[1]]
matrix(p.adjust(test, method = "BH"), nrow = 4, ncol = 4, dimnames = dimnames(test))
##                  RF          lm           NN regr.featureless
## RF                1 3.68824e-05 1.477089e-10     1.477089e-10
## lm                1 1.00000e+00 4.676359e-07     4.676359e-07
## NN                1 1.00000e+00 1.000000e+00     1.000000e+00
## regr.featureless  1 1.00000e+00 9.821731e-02     1.000000e+00

7 Nested Validation and Tuning

A final complication is presented by model tuning. Tuning is the process of taking a set of possible configuration for model hyper-parameters, trying each configuration by fitting/predicting the model on data, and selecting the configuration with the best performance.

1) The wrong way

  1. Split data into training and testing
  2. Treat each configuration as a different model
  3. Benchmark ‘models’ in the ‘usual’ way
  4. Select best performing model
  5. Re-run benchmark comparing tuned model to other models

2) Still wrong

  1. Split data into training and testing
  2. Fit each configuration on the complete training data
  3. Make and evaluate predictions on the training data
  4. Select optimal configuration that minimises this error
  5. Fit model to complete training data and make predictions on test data

3) The correct method

  1. Split data into training and testing
  2. Split training data into training and testing
  3. Fit different model configurations on the ‘nest’ training data
  4. Make and evaluate predictions on the nested testing data
  5. Select optimal configuration and fit model on complete training data
  6. Make predictions on test data

Note:

  • Method 1) means that models are compared unfairly. A model has been optimised on a dataset, and then trained and tested on the exact same dataset!
  • Method 2) just calculates the training error for the tuning configurations and therefore is guaranteed to generalise poorly.
  • Method 3) uses cross-validation to estimate generalisation error. Internal folds usually use CV 2-3 with outer 5 or 10.

Ideally three datasets are used, one for model tuning, one for training, and one for predictions. But in practice this is rarely possible, so nested cross-validation serves as a useful estimate to approximate this.

task = tsk("mtcars")
instance = TuningInstanceSingleCrit$new(
  task = task,
  learner = lrn("regr.svm", type = "nu-regression"),
  resampling = rsmp("cv", folds = 3),
  measure = msr("regr.rmse"),
  search_space = ParamSet$new(list(ParamDbl$new("nu", lower = 0, upper = 1))),
  terminator = trm("evals", n_evals = 10)
)
tt = tnr("random_search")
tt$optimize(instance)
instance$result
##           nu learner_param_vals  x_domain regr.rmse
## 1: 0.4679156          <list[2]> <list[1]>  3.351495
svm_tuned = AutoTuner$new(
  learner = lrn("regr.svm", type = "nu-regression"),
  resampling = rsmp("cv", folds = 3),
  measure = msr("regr.rmse"),
  search_space = ParamSet$new(list(ParamDbl$new("nu", lower = 0, upper = 1))),
  terminator = trm("evals", n_evals = 60),
  tuner = tnr("random_search")
)
lm = lrns(c("regr.lm", "regr.featureless"))
bm = benchmark(benchmark_grid(task, c(list(svm_tuned), lm), rsmp("cv", folds = 5)))
bm$aggregate(msrs(c("regr.rmse", "regr.rmse_se")))[,c("learner_id", "regr.rmse", "regr.rmse_se")]
##          learner_id regr.rmse regr.rmse_se
## 1:   regr.svm.tuned  3.479344    0.8513176
## 2:          regr.lm  3.150791    0.7443326
## 3: regr.featureless  5.940268    1.5540413

8 Summary

  • No such thing as basic validation
  • Model results are different from model performance results - Don’t optimise the results!
  • Validation is a basic requirement in any ML workflow.
  • Composite/ensemble models are not necessarily better than simpler ones.
  • Linear regression is your friend not your enemy.
  • Internal validation is good practice but be careful of incorrectly computing the training error instead of the test error.
  • Drawing conclusions about model performance requires both standard errors and/or significance testing; determining if a model is `good’ requires a baseline.
  • Model tuning requires nested validation.
  • To answer “What if the answers are wrong?”, reply “It’s the performance that matters”.

Software

  • mlr - Now in maintenance mode, used for Wilcox test which will soon be in mlr3 (Bischl et al. 2016)
  • mlr3 - For model fitting and predicting (Lang, Binder, et al. 2019)
  • mlr3tuning - For model tuning (Lang, Richter, et al. 2019)
  • mlr3pipelines - For pipelines including bagging (Binder et al. 2019)
  • mlr3learners - For loading mlr3 learners (Lang et al. 2020)
  • paradox - For mlr3 parameter sets (Lang, Bischl, et al. 2019)

References

Binder, Martin, Florian Pfisterer, Bernd Bischl, Michel Lang, and Susanne Dandl. 2019. “mlr3pipelines: Preprocessing Operators and Pipelines for ’mlr3’.” CRAN.

Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M Jones. 2016. “mlr: Machine Learning in R.” Journal of Machine Learning Research 17 (170): 1—–5. http://jmlr.org/papers/v17/15-066.html.

Box, George E. P. 1976. “Science and Statistics.” Journal of the American Statistical Association 71 (356): 791—–799. http://links.jstor.org/sici?sici=0162-1459{\%}28197612{\%}2971{\%}3A356{\%}3C791{\%}3ASAS{\%}3E2.0.CO{\%}3B2-W.

Gupta, Rishi K, Michael Marks, Thomas H. A. Samuels, Akish Luintel, Tommy Rampling, Humayra Chowdhury, Matteo Quartagno, et al. 2020. “Systematic Evaluation and External Validation of 22 Prognostic Models Among Hospitalised Adults with Covid-19: An Observational Cohort Study.” medRxiv. Cold Spring Harbor Laboratory Press. https://doi.org/10.1101/2020.07.24.20149815.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning. Springer New York Inc.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An introduction to statistical learning. Vol. 112. New York: Springer.

Király, Franz J, Bilal Mateen, and Raphael Sonabend. 2018. “NIPS - Not Even Wrong? A Systematic Review of Empirically Complete Demonstrations of Algorithmic Effectiveness in the Machine Learning and Artificial Intelligence Literature,” December. http://arxiv.org/abs/1812.07519.

Lang, Michel, Quay Au, Stefan Coors, and Patrick Schratz. 2020. “mlr3learners: Recommended Learners for ’mlr3’.” CRAN.

Lang, Michel, Martin Binder, Jakob Richter, Patrick Schratz, Florian Pfisterer, Stefan Coors, Quay Au, Giuseppe Casalicchio, Lars Kotthoff, and Bernd Bischl. 2019. “mlr3: A modern object-oriented machine learning framework in R.” Journal of Open Source Software 4 (44): 1903. https://doi.org/10.21105/joss.01903.

Lang, Michel, Bernd Bischl, Jakob Richter, Xudong Sun, and Martin Binder. 2019. “paradox: Define and Work with Parameter Spaces for Complex Algorithms.” CRAN. https://cran.r-project.org/package=paradox.

Lang, Michel, Jakob Richter, Bernd Bischl, and Daniel Schalk. 2019. “mlr3tuning: Tuning for ’mlr3’.” CRAN. https://cran.r-project.org/package=mlr3tuning.

Mateen, Bilal, and Raphael Sonabend. 2019. “All I want for Christmas is…Rigorous validation of predictive models to prevent hasty generalisations.” Significance 16 (6). John Wiley & Sons, Ltd (10.1111): 20–24. https://doi.org/10.1111/j.1740-9713.2019.01336.x.

Polley, Eric C, and Mark J Van Der Laan. 2010. “Super learner in prediction.” bepress.

Tan, R, V Osman, and G Tan. 1997. “Ear Size as a Predictor of Chronological Age.” Archives of Gerontology and Geriatrics 25 (2). Elsevier: 187–91.